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1 INTRODUCTION 


In this research investigation, firstly the author presents a Criterion Of 
Applicability Of K-Means Clustering Algorithm On A Given Data 
Based On The Limit Of Variation Of Results Of Several Runs Of K- 
Means Clustering Algorithm. It should be noted that the results are 


data specific and can change from data to data. 
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2 THE K-MEANS CLUSTERING ALGORITHM 


2.1 The K-Means Clustering Algorithm 


K-means is one of the simplest unsupervised learning algorithms that 
solve the well known clustering problem. The procedure follows a 
simple and easy way to classify a given data set through a certain 
number of clusters (assume K clusters) fixed apriori. The main idea is 
to define K centers, one for each cluster. These centers should be 
placed in a cunning way because of different location causes different 
result. So, the better choice is to place them as much as possible far 
away from each other. The next step is to take each point belonging to 
a given data set and associate it to the nearest center. When no point is 
pending, the first step is completed and an early group age is done. At 
this point we need to re-calculate K new centroids as barycenter of the 
clusters resulting from the previous step. After we have these K new 
centroids, a new binding has to be done between the same data set 
points and the nearest new center. A loop has been generated. As a 
result of this loop we may notice that the K centers change their 
location step by step until no more changes are done or in other words 
centers do not move any more. Finally, this algorithm aims at 


minimizing an objective function known as squared error function. 


There is a lot of literature on K-Means Clustering available on K-Means 
Clustering. Some good sources are [1], [2],[3], and [4]. Also Cluster 
Validation Measures such as SSW (Sum of Squares Within), Elbow Plot 


and Silhouette Plot are detailed well in these afore-referred sources. 


Por a given dataset of points, K-Means Clustering aims at finding 
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clusters in the data. 
The clusters are found using the following procedure: 


For any clustering problem, the number of clusters is to be a given 
quantity. Therefore, let us assume that we need to find K number of 


clusters from among the » data points. 


Step 1: Firstly, we randomly pick K number of random centers among 


the 7 data points that are to act as the K clusters centroids. 


Step 2: Now, for each such cluster centroid, we assign points to it that 


ate nearest to this cluster centroid than any other cluster centroid. 


Step 3: We now compute the new centroids of points belonging to each 


cluster again after such aforementioned assignments. 


Step 4: We now repeat the algorithm from Step 2 onwards and keep 


repeating this procedure till 


a) the Cluster centroids do not change anymore, i.e., they converge to 


some values. 
b) the points of a cluster remain in the same cluster. 


c) maximum number of iterations are reached. This number is pre- 


selected at the outset of the beginning of the algorithm. 


2.2 The K-Means Clustering Objective Function 


The objective of K-Means clustering is to minimize total intra-clustet 


variance, ie., the squared error function: 
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nj 2 
y=ssE=Y>|!x,-c\| 


jel i=l 


where 


aaa 4h 5 “th : : sth 
’x, is the i” point of the j” cluster, c; is the centroid of the j' 


cluster, K is the number of clusters and n ; 1s the number of elements 


of the j” cluster. 


Also, ¢ ; is given by 


2.3 Cluster Evaluation 
2.3.1 Compactness 


Sum of squares within clusters (SSW) or within cluster variance is given 


by 


ssw=3|f'x,-e} 


The index can only be used for numerical data because it requires 
centroids of clusters. SSW measures the compactness of clusters, and is 
suitable for centroid-based clustering, where hyperspherical clusters are 


desired. The value of SSW always decreases as the number of clusters 
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Fig: Mustration of Sum Of Squares Within Clusters 
2.3.2 Dunn’s Index 


Many clustering algorithms require the number of clusters given as an 
input parameter. This is a potential problem, as this number is often 
unknown. To overcome this problem, a number of cluster validation 
indices have been proposed in the literature. A cluster validation index, 
by definition, is a number that indicates the quality of a given clustering. 
Hence, if the correct number of clusters is not known, one can execute 
a clustering algorithm multiple times varying the number of clusters in 
each run from some minimum to some maximum value. For each 
clustering achieved under this procedure, the validation indices are 
computed. Eventually, the clustering that yields the best index value is 
returned as the final result. Cluster validation measure, such as the 
Dunn’s index (Dunn 1973 reflects compactness, connectedness, and 


separation of cluster partitions. 


The Dunn’s index (V,,) defines the ratio between the minimum intra 


cluster distance to maximal inter-cluster distance, and is computed as 


follows: 
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(V,,)=min} min C6) 
IsisK | 1sj<K,j#j| tax A(C, ) 


where d(C,,C,)= min ld(x,.x,) 


X;EC; x ;EC; 
is the distance between clusters C; and C ; (@nter-cluster distance), and 


A(C, )= ee lax, as ) 
is the intra-cluster distance of cluster C,. The value of K for V,, 


which is maximized, is taken as the optimal number of clusters. 


2.3.3 Silhouette Score 


The Silhouette Score is a measure of how much similarity an object 
bears to its own cluster (cohesion) compared to other clusters 
(separation). The values of the Silhouette Score range from -1 to +1. 
When the Silhouette Score is high, it indicates how well an object 
matches to its own cluster and how poorly it matches with the 


neighbouring clusters. 


In our study, we calculate the Silhouette Score in the Euclidean 
Distance Metric. 
Firstly, we compute the mean distance between 1 € C, (data point i in 
the cluster C,) and all other data points in the same cluster, as 

1 ad 
ai)=— dial.) 


Ic 1 JEG 14] 


i 
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where d (i, J ) is the distance between data points iand j in the cluster 


C, and |C, 





indicates the number of data points in the Cluster C,. We 


divide by IC, —1 as we do not include the distance d(i, i) in the sum. 





The value a(i) can be interpreted as a measure of how well 7 belongs 


to its cluster (the smaller the value, the better the belongingness). 
We now compute the mean distance of point 1 to some cluster C;, as 
the mean of the distance from 7 to all points in C,. That is, we 


— )'d(i, j) 


compute 
| k| JeC, 





For each data point 1 € C;, we define 


b(i)= min > ali, /) 


k#i C, jee 








to be the smallest mean distance of J to all points in any other cluster, 
and the cluster with this smallest aforementioned mean distance is said 


to be the neighbouring cluster of 1. 


The Silhouette Score of one data point I is defined as 





s(i)= di) = ali) if|C,|>1 and 


max {a(i),b(i)} 


s(i)=0, if |C,|=1 


A COEFFICIENT OF CAUSATION 


2.3.4 The Elbow Plot 


For the K-means clustering method, the most common approach for 
answering this question is the so-called e/bow method. It involves running 
the algorithm multiple times over a loop, with an increasing number of 
cluster choice and then plotting a clustering score as a function of the 
number of clusters. The score is, in general, a measure of the input data 
on the K-means objective function i.e., some form of intra-cluster distance 


relative to inner-cluster distance. 


Elbow method gives us an idea on what a good & number of clusters 
would be based on the sum of squared distance (SSE) between data 
points and their assigned clusters’ centroids. We pick & at the spot where 


SSE starts to flatten out and forming an elbow. 


The elbow method for determining number of clusters 
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2.4 Applications of K-Mean Clustering 


It is relatively efficient and fast. It computes result at O(tkn), where n is 


number of objects or points, k is number of clusters and t is number of 
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iterations. 
K-means clustering can be applied to machine learning or data mining 


Used on acoustic data in speech understanding to convert waveforms 
into one of k categories (known as Vector Quantization or Image 


Segmentation). 


Also used for choosing color palettes on old fashioned graphical display 


devices and Image Quantization. 


K-means algorithm is useful for undirected knowledge discovery and is 
relatively simple. K-means has found wide spread usage in lot of fields, 
ranging from unsupervised learning of neural network, Pattern 
recognitions, Classification analysis, Artificial intelligence, image 


processing, machine vision, and many others. 
3.1 Disadvantages of the K-Means Clustering Algorithm 


The main disadvantage with the K-Means Clustering algorithm is that it 
is difficult to predict the K value. Furthermore, different random 
initializations of centroids can result in different final clusters. Also, if 
the original data has clusters of different size and different density, K- 
Means does not work well. Finally, the K-Means clustering algorithm 


generally provides solutions that are local optima for a given data set. 
Summarily, 


It takes many iterations to converge 
Is very sensitive to initialization 
Random initialization can easily get two centers in the same cluster 


K-means gets stuck in a local optimum 
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3 THE K-MEANS CLUSTERING ALGORITHM 
APPLICABILITY CRITERION 


3.1 The Applicability Criterion 
This analysis is presented for the univariate case of dataset. 


Let the data oints be represented by x, =1ton. 


Let x, 


ica PE the Cluster Average of the Cluster to which x, belongs to 
inthe j” Run of the K-Means Clustering Algorithm. 


Let x. 


i-cm D€ the Cluster Maximum of the Cluster to which x, belongs 
toin the j” Run of the K-Means Clustering Algorithm. 


Let x, 


ij—C min 


be the Cluster Minimum of the Cluster to which x, belongs 


toin the j” Run of the K-Means Clustering Algorithm. 


Let x,-cs; be the Cluster Silhouette Score of the Cluster to which x, 


belongs to in the j” Run of the K-Means Clustering Algorithm. 


Let x; csswbe the Cluster Sum Of Squares Within of the Cluster to 


which x, belongs to in the j” Run of the K-Means Clustering 





Algorithm. 
We now compute the Deviations 
DX j-cavg : 
6;-cave =)— ass ) where N is the number of Runs of the K- 
ii-Can W Cag 


Means Clustering Algorithm. 


Similarly, we compute 


> Xij_-c max 
Sia = 
65-¢ max N (x,,¢ max ) 
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Ds Xij—c min 


= J = 
Oi-c min Xij—c min ) 


2», Xij-css 


oe J — 
6, ij-CsS — N (x,-css ) 


Dy Xij-cssw 


— J aes 
55-cssw = ti cbsay) 


N 


We now Min-Max Normalize in the Range [0, 1] all the above sets of 


values Oy-cavg 5) 6 y-cmox 5) Oc min 5) 6-css 5) O-cssw separately (such 


aforementioned normalization done separately for each set of values). 


Let these thusly normalized values be represented by 5). cays Sj-cmn » 


Oy-cmin > Oi-css > Oj-cssw* 
We also compute the (Sample) Standard Deviations of these 


aforementioned normalized sets of values, separately for each set, as: 


21/2 


DO ystawe |» 
J 
N / O-cave 





> O(j-cave)s = N-1 


ij-Cavg 
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3) 12 
oS 6i-c max 
J a 
NO ij-C max 
6 -cmax > O(i-cmax)s = N-1 
312 
» ij-C min ms 
£ 
N ij-C min 
8 5-emin > Oj-cmin)s = N-1 
3) 12 
_, Oij-css x 
: N 7 Oj-css 
5, css > O(F-cs5)5 = N-1 
3) '2 
_, Pij-cssw 
; N ~ Cij-cssw 
5 essw > OlF_cssw)s = N-1 








As the total number of unique results possible in a K-Means Clustering 
Algorithm for making K Clusters with ” data points is given by m="C, 


the Population Standard Deviations of the above Sample Standard 
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Deviations are given by 


O(ij-cavg)Pop — VM * F(y_cavg)s 


O (ij-C max Pop — VIM * O(c max)s 
O(ij-cmin)Pop = VIM * O(c min)s 
O(ij-css)Pop = VIN + O(y_css)s 


Oy-cssw)Pop — vm * Oli_cssw)s 

We now Min-Max Normalize in the Range [0, 1] all the above sets of 
values Oij-cavg)Pop > F(ij-Cmax)Pop > F(i-Cmin)Pop > F(ij-c8s)Pop > PF (j-Cssw)Pop separately 
(such aforementioned normalization done separately for each set of 


values). Let these thusly normalized values be represented by 6; 


ij-Cavg)Pop > 


O(ij-c max )Pop > F(ij-Cmin)Pop » P(ij-Cs8)Pop > P (i¥-CsSW)Pop 


We now find a Weighted Average Of G(% ca)pop> O(j-cmax)Pop > Oly-cain)Pop > 


GF (i-css)Pop > F-cssw)pop tO find the value of Variation of results of the K- 


Means Clustering Algorithm Run N Times. Let this weighed average be 
denoted by v. We can then say r=(l-v) as the Coefficient Of Robustness 
of the results of the K-Means Clustering Algorithm for a given data set. 
The advantages of this value is that if the Variation of the results is high 
among the N runs of the K-Means Clustering Algorithm for the given 
data, then the results are not acceptable for reporting. Therefore, we 
can use this concept in specifying a value of r for each data set for 
running the K-Means Clustering Algorithm, so that the results are 


acceptable. Furthermore, for a given 7, we can even compute the N. 
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